Add conditional token filter to elasticsearch #31958

romseygeek · 2018-07-11T11:07:09Z

Lucene has a ConditionTokenFilter which allows you to selectively apply tokenfilters, depending on the state of the current token in the tokenstream. This commit exposes this functionality in elasticsearch, adding a new AnalysisPredicateScript context.

elasticmachine · 2018-07-11T11:07:47Z

Pinging @elastic/es-search-aggs

romseygeek · 2018-07-11T11:10:01Z

This is a first step adding the analysis script context, and illustrating how to use it with the conditional token filter. I'd also like to look at adding scriptable filtering filters (ie, 'ignore this token if it matches a predicate'), and possibly the ability to make changes to the token itself.

romseygeek · 2018-07-11T11:11:41Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

+  int endOffset
+  String type
+  boolean isKeyword
+}


I'm exposing this information directly at the moment, but it might be better to expose getters instead so that consumers don't try and do things like change offsets or position increments

+1 to exposing getters

also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?

Painless doesn't require the get part of the getter anyway so I'd prefer getters.

romseygeek · 2018-07-11T11:12:36Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/AnalysisScriptTests.java

+
+    public void testAnalysisScript() {
+        AnalysisPredicateScript.Factory factory = scriptEngine.compile("test", "return \"one\".contentEquals(term.term)",
+            AnalysisPredicateScript.CONTEXT, Collections.emptyMap());


Having to use contentEquals is a bit trappy here as well, but I want to avoid creating Strings on every call to incrementToken().

jpountz

I left some comments but like it overall.

jpountz · 2018-07-11T12:34:48Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

+  int endOffset
+  String type
+  boolean isKeyword
+}


+1 to exposing getters

jpountz · 2018-07-11T12:35:55Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

+  int endOffset
+  String type
+  boolean isKeyword
+}


also should we expoe the absolute position rather than the position increment? This looks more useful to me for filtering. Or both?

jpountz · 2018-07-11T12:36:59Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc {
  int getLength()
  float getFreq()
 }
+
+class org.elasticsearch.script.AnalysisPredicateScript$Term {


should it be called Token rather than Term? It might be just me but I feel like it better carries the information that there are other attributes here like positions and offsets.

I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.

jpountz · 2018-07-11T12:40:55Z

...es/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java

@@ -202,6 +222,8 @@
        filters.put("classic", ClassicFilterFactory::new);
        filters.put("czech_stem", CzechStemTokenFilterFactory::new);
        filters.put("common_grams", requriesAnalysisSettings(CommonGramsTokenFilterFactory::new));
+        filters.put("condition",
+            requriesAnalysisSettings((i, e, n, s) -> new ScriptedConditionTokenFilterFactory(i, n, s, scriptService.get())));


requires has a typo?

this is a pre-existing typo; I'll open a separate PR to fix it

jpountz · 2018-07-11T12:42:48Z

docs/reference/analysis/tokenfilters/condition-tokenfilter.asciidoc

+[float]
+=== Options
+[horizontal]
+filters:: a list of token filters to apply to the current token if the predicate


We should be explicit that these filters are chained. Maybe call it filter (no plural) for consistency with analyzers?

jpountz · 2018-07-11T12:43:54Z

...mon/src/main/java/org/elasticsearch/analysis/common/ScriptedConditionTokenFilterFactory.java

+        }
+        this.factory = scriptService.compile(script, AnalysisPredicateScript.CONTEXT);
+
+        this.filterNames = settings.getAsList("filters");


should we fail on an empty list?

nik9000 · 2018-07-11T18:01:54Z

docs/reference/analysis/tokenfilters/condition-tokenfilter.asciidoc

+                    "type" : "condition",
+                    "filters" : [ "lowercase" ],
+                    "script" : {
+                        "source" : "return term.term.length() < 5"  <1>


I don't believe you need the return here and I think it'd read a little better without it.

nik9000 · 2018-07-11T18:09:40Z

...es/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java

+
+    @Override
+    public Collection<Object> createComponents(Client client, ClusterService clusterService, ThreadPool threadPool, ResourceWatcherService resourceWatcherService, ScriptService scriptService, NamedXContentRegistry xContentRegistry, Environment environment, NodeEnvironment nodeEnvironment, NamedWriteableRegistry namedWriteableRegistry) {
+        this.scriptService.set(scriptService);


Where're never very happy with using SetOnce like this. It gets the job done but it reaks guice's "everything depends on everything"-ness that we've worked so hard to remove over the years. Not that I have anything better though.

nik9000 · 2018-07-11T18:15:53Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

@@ -175,3 +175,13 @@ class org.elasticsearch.index.similarity.ScriptedSimilarity$Doc {
  int getLength()
  float getFreq()
 }
+
+class org.elasticsearch.script.AnalysisPredicateScript$Term {


I think we should look at creating a new context and adding this to the whitelist of that context. See `plugins/example/painless-whitelist' for some of that.

nik9000 · 2018-07-11T18:17:00Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/spi/org.elasticsearch.txt

+  int endOffset
+  String type
+  boolean isKeyword
+}


Painless doesn't require the get part of the getter anyway so I'd prefer getters.

nik9000 · 2018-07-11T18:18:35Z

modules/lang-painless/src/test/java/org/elasticsearch/painless/AnalysisScriptTests.java

+import java.util.List;
+import java.util.Map;
+
+public class AnalysisScriptTests extends ScriptTestCase {


I think this should be in your plugin instead of painless if we can manage it.

I don't follow - do you mean moving the analysis script context out of core and into the common analysis plugin, or do you mean creating an entirely new plugin just for this token filter (and any other script-based filters we come up with)?

I mean I feel like it should live in the common analysis plugin and not in the server at all. Since it is a new thing I figure we should isolate it as much as we can so we know when stuff is using it. Not because it is dirty or anything, just because a smaller core is easier to reason about. And keeping this entirely within the analysis-common plugin helps to prove out the work that we've done for plugins extending painless.

The tricky part here will be unit testing, as ScriptTestCase is a painless-specific class. The painless extension module uses rest tests only, but I'd much rather have compiled unit tests here. Maybe we should build a separate painless-testing module, that can be used for tests elsewhere?

In general we try to unit test this sort of thing without painless and use a mock script engine instead that just calls code on the test class. It is totally reasonable to have a few integration tests that do use painless though.

The previous errors in compileJava were not cause by the brackets but my the content of the @link section. Corrected this so its a working javadoc link again.

…2042) When a replica is fully recovered (i.e., in `POST_RECOVERY` state) we send a request to the master to start the shard. The master changes the state of the replica and publishes a cluster state to that effect. In certain cases, that cluster state can be processed on the node hosting the replica *together* with a cluster state that promotes that, now started, replica to a primary. This can happen due to cluster state batched processing or if the master died after having committed the cluster state that starts the shard but before publishing it to the node with the replica. If the master also held the primary shard, the new master node will remove the primary (as it failed) and will also immediately promote the replica (thinking it is started). Sadly our code in IndexShard didn't allow for this which caused [assertions](https://github.com/elastic/elasticsearch/blob/13917162ad5c59a96ccb4d6a81a5044546c45c22/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java#L482) to be tripped in some of our tests runs.

Add EC2 credential test for repository-s3 Relates to #26913

This change adds two contexts the execute scripts against: * SEARCH_SCRIPT: Allows to run scripts in a search script context. This context is used in `function_score` query's script function, script fields, script sorting and `terms_set` query. * FILTER_SCRIPT: Allows to run scripts in a filter script context. This context is used in the `script` query. In both contexts a index name needs to be specified and a sample document. The document is needed to create an in-memory index that the script can access via the `doc[...]` and other notations. The index name is needed because a mapping is needed to index the document. Examples: ``` POST /_scripts/painless/_execute { "script": { "source": "doc['field'].value.length()" }, "context" : { "search_script": { "document": { "field": "four" }, "index": "my-index" } } } ``` Returns: ``` { "result": 4 } ``` POST /_scripts/painless/_execute { "script": { "source": "doc['field'].value.length() <= params.max_length", "params": { "max_length": 4 } }, "context" : { "filter_script": { "document": { "field": "four" }, "index": "my-index" } } } Returns: ``` { "result": true } ``` Also changed PainlessExecuteAction.TransportAction to use TransportSingleShardAction instead of HandledAction, because now in case score or filter contexts are used the request needs to be redirected to a node that has an active IndexService for the index being referenced (a node with a shard copy for that index).

Today it is unclear what guarantees are offered by the search preference feature, and we claim a guarantee that is stronger than what we really offer: > A custom value will be used to guarantee that the same shards will be used > for the same custom value. This commit clarifies this documentation. Forward-port of #32098 to `master`.

* UUID field was added for #31791 and only went into 6.4 and 7.0 * Fixes #32119

…2157) When building custom tokenfilters without an index in the _analyze endpoint, we need to ensure that referring filters are correctly built by calling their #setReferences() method Fixes #32154

romseygeek · 2018-08-21T15:13:36Z

@nik9000 the fix for ScriptContexts in modules is now in, so I think this is ready for re-review.

nik9000

LGTM

nik9000 · 2018-09-04T19:42:34Z

...es/analysis-common/src/main/java/org/elasticsearch/analysis/common/CommonAnalysisPlugin.java

+    }
+
+    @Override
+    @SuppressWarnings("rawtypes")  // TODO ScriptPlugin needs to change this to pass precommit?


Did you mean to leave this TODO?

I think it's a backwards breaking change? ScriptPlugin#getContexts() needs to change return value from List<ScriptContext> to List<ScriptContext<?>>. Which ought to be done, but it's a separate change. I'll open an issue.

Opening an issue makes sense. Yeah, it is a separate thing.

This allows tokenfilters to be applied selectively, depending on the status of the current token in the tokenstream. The filter takes a scripted predicate, and only applies its subfilter when the predicate returns true.

* master: Fix deprecated setting specializations (elastic#33412) HLRC: split cluster request converters (elastic#33400) HLRC: Add ML get influencers API (elastic#33389) Add conditional token filter to elasticsearch (elastic#31958) Build: Merge xpack checkstyle config into core (elastic#33399) Disable IndexRecoveryIT.testRerouteRecovery. INGEST: Implement Drop Processor (elastic#32278) [ML] Add field stats to log structure finder (elastic#33351) Add interval response parameter to AutoDateInterval histogram (elastic#33254) MINOR+CORE: Remove Dead Methods ClusterService (elastic#33346)

romseygeek added 6 commits July 11, 2018 10:13

WIP

e5b20de

WIP

da0fd1e

WIP

df9bffc

WIP

402ed36

WIP

fb7c21d

docs

d8f0170

romseygeek self-assigned this Jul 11, 2018

romseygeek added >feature :Search Relevance/Analysis How text is split into tokens v6.4.0 labels Jul 11, 2018

romseygeek added the v7.0.0 label Jul 11, 2018

romseygeek requested review from nik9000 and jpountz July 11, 2018 11:10

romseygeek commented Jul 11, 2018

View reviewed changes

romseygeek added 3 commits July 11, 2018 12:48

tests

bcee3f0

d'oh

bba5939

class name change in SPI

21cd02f

jpountz reviewed Jul 11, 2018

View reviewed changes

docs

4315682

nik9000 reviewed Jul 11, 2018

View reviewed changes

romseygeek and others added 8 commits July 13, 2018 13:13

Broekn

52955df

nuke unit test

dd139c7

Merge branch 'master' into scripted-analysis

1d5deff

feedback

57a73f2

Term -> Token; move ScriptContext into module

609951b

Re-instate link in StringFunctionUtils javadocs

801a704

The previous errors in compileJava were not cause by the brackets but my the content of the @link section. Corrected this so its a working javadoc link again.

Docs: Change formatting of Cloud options

f923d9c

Docs: Restyled cloud link in getting started

a21fb82

bleskes and others added 14 commits July 18, 2018 15:48

Add EC2 credential test for repository-s3 (#31918)

8924ac3

Add EC2 credential test for repository-s3 Relates to #26913

use before instead of onOrBefore

d225048

Fix BwC Tests looking for UUID Pre 6.4 (#32158)

f591095

* UUID field was added for #31791 and only went into 6.4 and 7.0 * Fixes #32119

Call setReferences() on custom referring tokenfilters in _analyze (#3…

67a4dcb

…2157) When building custom tokenfilters without an index in the _analyze endpoint, we need to ensure that referring filters are correctly built by calling their #setReferences() method Fixes #32154

Merge conflicts

93ecf1d

more docs

945fadf

tests for all script variables

303de4f

Merge branch 'master' into scripted-analysis

33b9da4

Merge branch 'master' into scripted-analysis

b775f71

merge error

1dff0f6

checkstyle

546aa11

romseygeek added v6.5.0 and removed v6.4.0 labels Aug 21, 2018

romseygeek added 2 commits August 21, 2018 13:06

headers

701fbf2

Use actual painless syntax, not my own made-up syntax

396843f

nik9000 approved these changes Sep 4, 2018

View reviewed changes

Merge branch 'master' into scripted-analysis

8bc3d65

romseygeek merged commit 6364427 into elastic:master Sep 5, 2018

romseygeek deleted the scripted-analysis branch September 5, 2018 13:52

romseygeek mentioned this pull request Sep 6, 2018

Scripted analysis components #26100

Closed

This was referenced Dec 13, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

Add support for condition token filter elastic/elasticsearch-net#3523

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conditional token filter to elasticsearch #31958

Add conditional token filter to elasticsearch #31958

romseygeek commented Jul 11, 2018

elasticmachine commented Jul 11, 2018

romseygeek commented Jul 11, 2018

romseygeek Jul 11, 2018

jpountz Jul 11, 2018

jpountz Jul 11, 2018

nik9000 Jul 11, 2018

romseygeek Jul 11, 2018

jpountz left a comment

jpountz Jul 11, 2018

jpountz Jul 11, 2018

jpountz Jul 11, 2018

nik9000 Jul 11, 2018

jpountz Jul 11, 2018

romseygeek Jul 12, 2018 •

edited

Loading

jpountz Jul 11, 2018

jpountz Jul 11, 2018

nik9000 Jul 11, 2018

nik9000 Jul 11, 2018

nik9000 Jul 11, 2018

nik9000 Jul 11, 2018

nik9000 Jul 11, 2018

romseygeek Jul 12, 2018

nik9000 Jul 12, 2018

romseygeek Jul 16, 2018

nik9000 Jul 16, 2018

romseygeek commented Aug 21, 2018

nik9000 left a comment

nik9000 Sep 4, 2018

romseygeek Sep 5, 2018

nik9000 Sep 5, 2018

Add conditional token filter to elasticsearch #31958

Add conditional token filter to elasticsearch #31958

Conversation

romseygeek commented Jul 11, 2018

elasticmachine commented Jul 11, 2018

romseygeek commented Jul 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek commented Aug 21, 2018

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romseygeek Jul 12, 2018 •

edited

Loading